![]() | |
Aglomera.NET |
A hierarchical agglomerative clustering (HAC) library written in C#
Aglomera is a .NET open-source library written entirely in C# that implements hierarchical clustering (HC) algorithms. A cluster refers to a set of instances or data-points. HC can either be agglomerative (bottom-up approach) or divisive (top-down approach). The distance between each instance is calculated using some dissimilarity function. The distance between clusters is calculated using some linkage criterion. Each step of HC produces a new cluster-set, i.e., a set of clusters, from the cluster-set of the previous step.
Agglomerative HC starts with a cluster-set in which each instance belongs to its own cluster. At each step, it merges the two closest clusters, until all clusters have been merged into a single cluster containing all instances, i.e., it ends with a cluster-set containing a single cluster with all instances. Divisive HC works in reverse - it starts by having a cluster-set with one cluster containing all instances. At each step, it splits clusters recursively, using some splitting method, until reaching one cluster-set containing only singletons, i.e., where each instance is placed in its own cluster.
The clustering result is a list containing the cluster-set and the corresponding dissimilarity / distance at which it was created at each step of the algorithm. The result is organized in a hierarchical form, i.e., where each cluster references either the two parents that were merged for its creation (in the agglomerative approach), or the two children resulting from splitting the cluster (in the divisive approach). Due to their hierarchical nature, clustering results can be visualized via a dendrogram.
Currently, Aglomera.NET implements program AGNES (AGglomerative NESting) of [Kaufman & Rousseeuw, 1990], i.e., the bottom-up approach, the It supports different linkage criteria and also provides several metrics to perform internal and external evaluation of clustering results. The results of clustering can be exported to a Json file to be visualized as a dendrogram in DendrogramViewer, an interactive web-application using D3.js.
Aglomera.NET is open-source under the MIT license and is free for commercial use.
Supported platforms:
The following packages with the corresponding dependencies are provided:
Aglomera: core package, including clustering algorithm, linkage criteria and evaluation metrics.
Aglomera.D3: package to export clustering results to Json files to be visualized with D3.js.
You can git clone the Aglomera.NET source code and use an IDE like VisualStudio to build the corresponding binaries.
Consider the following data-set example taken from [Kaufman & Rousseeuw, 1990]:
where colors indicate the "real" instance class, i.e., either 'A=red' or 'B=blue'.
Start by defining a data-point class, for example one to represent points in a 2D Euclidean space, such as:
class DataPoint : IComparable<DataPoint> { public DataPoint(string id, double x, double y) { ... } public int CompareTo(DataPoint other) { ... } ... }
and then define a dissimilarity metric for this type:
class DssimilarityMetric : IDissimilarityMetric<DataPoint> { public double Calculate(DataPoint instance1, DataPoint instance2) { ... } }
We can then define the data-set by using:
var dataPoints = new HashSet<DataPoint>( new[] { new DataPoint("1", 2.00, 2.00), new DataPoint("2", 5.50, 4.00), new DataPoint("3", 5.00, 5.00), new DataPoint("4", 1.50, 2.50), new DataPoint("5", 1.00, 1.00), new DataPoint("6", 7.00, 5.00), new DataPoint("7", 5.75, 6.50) });
We now select a linkage criterion and create the clustering algorithm:
var metric = new DissimilarityMetric(); var linkage = new AverageLinkage<DataPoint>(metric); var algorithm = new AgglomerativeClusteringAlgorithm<DataPoint>(linkage);
The clustering result is then obtained by simply executing:
var clusteringResult = algorithm.GetClustering(dataPoints);
Enumerating the result (a ClusteringResult<DataPoint> object) yields the following:
[0] {0.000 {(1), (2), (3), (4), (5), (6), (7)}} [1] {0.707 {(2), (3), (5), (6), (7), (1;4)}} [2] {1.118 {(5), (6), (7), (1;4), (2;3)}} [3] {1.498 {(6), (7), (2;3), (1;4;5)}} [4] {1.901 {(7), (1;4;5), (2;3;6)}} [5] {2.047 {(1;4;5), (2;3;6;7)}} [6] {5.496 {(1;4;5;2;3;6;7)}}
from which we can select the appropriate data-set, e.g., according to the number of clusters, the distance, external criteria, etc.
Supports the following linkage criteria, used to consider the dissimilarity between clusters:
Provides the following external clustering evaluation criteria, used to evaluate the quality of a given cluster-set when each data-point has associated a certain label / class:
Purity, normalized mutual information, accuracy, precision, recall, F-measure.
To externally-evaluate the clustering result, start by indicating the class of each data-point, e.g., a char, and an evaluation criterion:
var pointClasses = new Dictionary<DataPoint, char>{...}; var criterion = new NormalizedMutualInformation<DataPoint, char>();
The evaluation score of the 5th cluster-set is given by executing:
var score = criterion.Evaluate(clusteringResult[5], pointClasses);
Provides the following internal clustering evaluation criteria, used to select the optimal number of clusters when no ground truth is available:
Silhouette coefficient, Dunn index, Davies-Bouldin index, Calinski-Harabasz index, Modified Gamma statistic, Xie-Beni index, within-between ratio, I-index, Xu index, RMSSD, R-squared.
To internally-evaluate the clustering result, we simply choose an evaluation criterion and calculate the score:
var criterion = new SilhouetteCoefficient<DataPoint>(metric); var score = criterion.Evaluate(clusteringResult[5]);
CSV export
To export the result of clustering to a comma-separated values (CSV) file, we simply do:
clusteringResult.SaveToCsv(FILE_PATH);
which would produce a CSV file with the contents of each cluster in the cluster-set of each step of the algorithm, one instance per line.
D3.js export
Export the result of clustering to a Json file that contains the hierarchical structure of the clustering procedure that can be loaded into DendrogramViewer to produce a dendrogram, e.g.:
using Aglomera.D3;
...
clusteringResult.SaveD3DendrogramFile(fullPath, formatting: Formatting.Indented);
would produce Json text like the following:
{ "n": "(1;4;5;2;3;6;7)", "d": 5.5, "c": [ { "n": "(2;3;6;7)", "d": 2.05, "c": [ { "n": "(2;3;6)", "d": 1.9, "c": [ { "n": "(2;3)", "d": 1.12, "c": [ { "n": "(3)", "d": 0.0, "c": [] }, { "n": "(2)", "d": 0.0, "c": [] } ] }, { "n": "(6)", "d": 0.0, "c": [] } ] }, { "n": "(7)", "d": 0.0, "c": [] } ] }, { "n": "(1;4;5)", "d": 1.5, "c": [ { "n": "(1;4)", "d": 0.71, "c": [ { "n": "(4)", "d": 0.0, "c": [] }, { "n": "(1)", "d": 0.0, "c": [] } ] }, { "n": "(5)", "d": 0.0, "c": [] } ] } ] }
where n holds the name or id of the cluster, d is the dissimilarity / distance at which it was found and created, and c contains the list containing the pair of parents or children of the cluster.
When loaded in DendrogramViewer, this would produce the following dendrogram:
Example code can be found in the src/Examples folder in the repository. Several open-source data-sets adapted to work with the example applications can be found in src/Examples/datasets.
References
Other links
Copyright © 2018, Pedro Sequeira